analysis happiness dataset from kaggle https://www.kaggle.com/unsdsn/world-happiness
objectives: evaluate happiness levels among different countries evaluate evolution of happiness over the years identify the most/least happy countries identify factors correalted with happiness
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
df = pd.read_csv("2019.csv")
df.head()
df.shape
there are 156 rows and 9 columns
df.columns
list of columns
df.isna().sum()
there are no missing values
df[["Country or region", "Score"]].sort_values(by= "Score", ascending=False).head(10)
the 10 countries with the highest happiness score. 8 out of ten are located in Europe
df[["Country or region", "Score"]].sort_values(by= "Score", ascending=True).head(10)
top 10 countries with the lowest happiness score. The majority are located in Africa
df[["Country or region", "GDP per capita"]].sort_values(by= "GDP per capita", ascending=False).head(10)
top 10 countries by GDP per capita
df[["Country or region", "Perceptions of corruption"]].sort_values(by= "Perceptions of corruption", ascending=False).head(10)
top 10 countries by Perceptions of corruptions
df.corr()
The "score" variable is strongly positively correlated with "GDP per capita", "social support", "healty life expectancy" The "score" variable is positively correlated with "freedom to make life choices" Very strong positive correlation between GDP per capita and Helthy life expectancy
sns.relplot(x ="GDP per capita", y = "Score", data=df)
sns.relplot(x ="Social support", y = "Score", data=df)
sns.relplot(x ="Healthy life expectancy", y = "Score", data=df)
sns.relplot(x ="Perceptions of corruption", y = "Score", data=df)
plt.title("Distribution happiness score")
sns.kdeplot(df["Score"],shade=True)
df.kurtosis()
kurtosis of each column
df.skew()
skewness of each column
#PCA analysis (dimensionality reduction)
df1 = pd.read_csv("happinesscode.csv", delimiter=";")
I imported the dataset with ISO code
fig = px.choropleth(df1, locations="Country Code",
color="Score",
hover_name="Country or region",
color_continuous_scale=px.colors.sequential.Plasma, title="Happiness Score 2019")
fig.show()
df2015 = pd.read_csv("2015.csv")
df2016 = pd.read_csv("2016.csv")
df2017 = pd.read_csv("2017.csv")
df2018 = pd.read_csv("2018.csv")
df2019 = pd.read_csv("2019.csv")
df2015["Year"] = 2015
df2016["Year"] = 2016
df2017["Year"] = 2017
df2018["Year"] = 2018
df2019["Year"] = 2019
df2015.rename(columns={"Country": "Country or region", "Happiness Rank": "Overall rank", "Happiness Score": "Score", "Economy (GDP per capita)":"GDP per capita", "Family": "Social support", "Health (Life Expectancy)": "Healthy life expectancy", "Freedom": "Freedom to make life choices", "Trust (Government Corruption)": "Perceptions of corruption"}, inplace=True)
df2015.drop(columns=["Region", "Standard Error", "Dystopia Residual"], inplace=True)
df2015.rename(columns={"Economy (GDP per Capita)": "GDP per capita"}, inplace=True)
df2016.rename(columns={"Country": "Country or region", "Happiness Rank": "Overall rank", "Happiness Score": "Score", "Economy (GDP per Capita)":"GDP per capita", "Family": "Social support", "Health (Life Expectancy)": "Healthy life expectancy", "Freedom": "Freedom to make life choices", "Trust (Government Corruption)": "Perceptions of corruption"}, inplace=True)
df2016.drop(columns=["Region", "Dystopia Residual", "Lower Confidence Interval", "Upper Confidence Interval"], inplace=True)
df2017.rename(columns={"Country": "Country or region", "Happiness.Rank": "Overall rank", "Happiness.Score": "Score", "Economy..GDP.per.Capita.":"GDP per capita", "Family": "Social support", "Health..Life Expectancy.": "Healthy life expectancy", "Freedom": "Freedom to make life choices", "Trust..Government.Corruption.": "Perceptions of corruption"}, inplace=True)
df2017.drop(columns=["Dystopia.Residual", "Whisker.low", "Whisker.high"], inplace=True)
df2017.rename(columns={"Health..Life.Expectancy.": "Healthy life expectancy"}, inplace=True)
df2015 = df2015[["Overall rank", "Country or region", "Score", "GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption", "Year"]]
df2016 = df2016[["Overall rank", "Country or region", "Score", "GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption", "Year"]]
df2017 = df2017[["Overall rank", "Country or region", "Score", "GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption", "Year"]]
df2018 = df2018[["Overall rank", "Country or region", "Score", "GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption", "Year"]]
df2019 = df2019[["Overall rank", "Country or region", "Score", "GDP per capita", "Social support", "Healthy life expectancy", "Freedom to make life choices", "Generosity", "Perceptions of corruption", "Year"]]
df_all_years = pd.concat([df2015,df2016,df2017,df2018,df2019])
df_all_years.to_excel("happiness_all_years.xlsx")
df_all_years = pd.read_csv("happiness_all_years.csv", delimiter=";", na_values=["#N/D"])
df_all_years.drop(columns="Unnamed: 0", inplace=True)
df_all_years.dropna(inplace=True)
df_all_years = df_all_years[df_all_years.Year != 2017]
df_all_years["Score"] = df_all_years.Score.astype(float)
px.choropleth(df_all_years,
locations="Country Code",
color="Score",
hover_name="Country or region",
animation_frame="Year",
color_continuous_scale='Plasma',
height=500)
we can see that between 2015 and 2019 there has been an overall decrease in the happiness scores. It seems like countries have become less happy between 2015 and 2019
#PCA ANALYSIS
the first three principal components explain 83% of the variability
first three principal components. The first principal component is mainly an average of the variables The second principal component contrasts freedom to make life choices, generosity, perceptions of corruption with the other 3 variables The third principal component analysis contrasts perception of corruption with generosity and social support
result of a pca analysis: 1)considering the first principal component (x axis) on the right we find countries with high values for all the variables (apart from generosity) on the left we find countries with low values for basically all the values considering the second principal component (y axis) a the bottom we find countries with high values for generosity, perception of corruption and freedom to make life choices compared to the other variables. At the top we find countries with high values GDP per capita, social support, healthy life expectancy compared to the other variables